Data can be found here: https://github.com/irecsys/CARSKit/tree/master/context-aware_data_sets
The dataset was originaly useds in the CARSKit project (described in the paper https://arxiv.org/pdf/1511.03780.pdf), but we'll try other approaches using known open-source machine learning frameworks.
# Purpose: to understand data better
#
# How?: Exploratory analysis with simple feature engineering
#
# Steps:
# ------
# 1) Load and clean the data
# 2) Join the data to produce one data frame
# 3) Make basic feature enginnering
# a) count basic metric per user
# b) use domain knowledge (to introduce more information)
# b) hot-encoding of categorical variables
# c) transform variables, if neccesary
# 4) Pair-plot numeric features
# 4) Make seperatly raw and aggregated dataset (grouped by pair (user, song)
# 5) Try to segmentize users
# 6) Do dimensionality reduction
# 7) Make an entrée model
# 8) Export data
#
# Goals: a) to clean and engineer the data so it can be understood by most ML/DL tools
# b) to suggest on which features we should focus most
#
# Oskar Jarczyk (oskar.jarczyk@pja.edu.pl)
import itertools
import numpy as np
import pandas as pd
import pickle
import pandas_profiling
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
from apyori import apriori
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
filename = 'Data_InCarMusic.xlsx'
sheetname = 'ContextualRating'
contextual_rating = pd.read_excel(filename, sheet_name=sheetname)
print(f'Successfully read {len(contextual_rating):,} rows from the "{sheetname}" Excel sheet')
print(f'Columns found: {contextual_rating.columns.tolist()}')
sheetname = 'Music Track'
music_track = pd.read_excel(filename, sheet_name=sheetname)
print(f'Successfully read {len(music_track):,} rows from the "{sheetname}" Excel sheet')
print(f'Columns found: {music_track.columns.tolist()}')
sheetname = 'Music Category'
music_category = pd.read_excel(filename, sheet_name=sheetname, header=None)
print(f'Successfully read {len(music_category):,} rows from the "{sheetname}" Excel sheet')
print(f'Columns found: {music_category.columns.tolist()} (temporary column names)')
contextual_rating.columns.to_series().groupby(contextual_rating.dtypes).groups
Most of the column are of 'string' type, except for ID columns like the "user id", "song id", and the song rating
contextual_rating.dtypes
Column names have whitespace in them, let's remove them by using str.strip() method from the Python standard library.
contextual_rating.columns = [x.strip() for x in contextual_rating.columns.tolist()]
music_track.columns.to_series().groupby(music_track.dtypes).groups
Again, some column names have whitespace in them, let's stip them
music_track.columns = [x.strip() for x in music_track.columns.tolist()]
music_track.head()
Some of columns are not interesting or valuable at all. In example, description attribute is always empty, and the reference to (probably) album cover is invalid. Name of the mp3 file doesn't hold any information as well. On the other hand, we can use pair of (artist, title) to get lyrics (and their sentiment), but it's out of scope for this exploratory analysis.
music_track = music_track[['id', 'artist', 'title', 'category_id']].rename(columns={'id': 'ItemID'})
music_track.dtypes
# TODO: use Levenshtein distance to check for possible duplicates
most_common_artist = music_track.groupby('artist')['artist'].count()
most_common_artist.sort_values(ascending=False).head(12)
music_category.columns = ['category_id', 'category name']
music_category.dtypes
contextual_rating.iloc[3].to_dict()
music_track.iloc[10].to_dict()
data = contextual_rating.merge(music_track, suffixes=('_ctx', '_track'), on='ItemID', how='inner')
if len(data) == len(contextual_rating):
print(f'Success! Every music track is findable in the "music_track" dataframe')
else:
raise AssertionError("Some tracks are missing.")
data = data.merge(music_category, on='category_id', how='inner')
if len(data) == len(contextual_rating):
print(f'Success! Every music track has a proper music category assigned')
else:
raise AssertionError("Some categories are missing.")
counts = data.groupby('UserID')['UserID'].count()
data['number_of_songs'] = data.apply(lambda x: counts[x.UserID], axis=1)
counts.hist(bins=20)
counts = data.groupby(['UserID', ])['ItemID'].nunique()
data['number_of_unique_songs'] = data.apply(lambda x: counts[x.UserID], axis=1)
counts.hist(bins=20)
counts = data.groupby(['UserID', ])['category_id'].nunique()
data['number_of_unique_genres'] = data.apply(lambda x: counts[x.UserID], axis=1)
counts.hist(bins=10)
counts = data.groupby(['UserID', 'category name'])['category name'].count()
genre_counts = counts.groupby(level=0).apply(lambda x: x / float(x.sum()))
def second_dominant(c):
d, _max = {}, c.max()
for item in c.items():
e, v = item[0], item[1]
if v in d:
d[v] += ', ' + e
else:
d[v] = e
del d[_max]
try:
return d[max(list(d.keys()))]
except ValueError:
return None
data['dominant_genre'] = data.apply(lambda x: genre_counts[x['UserID']].idxmax(), axis=1)
data['second_dominant_genre'] = data.apply(lambda x: second_dominant(genre_counts[x['UserID']]), axis=1)
data['genre_ratio'] = data.apply(lambda x: genre_counts[x['UserID']][x['category name']], axis=1)
data['genre_ratio'].hist()
# assign ratio of the most dominant music genre
data['main_genre_dominance'] = data.apply(lambda x: genre_counts[x['UserID']].max(), axis=1)
data['main_genre_dominance'].hist(bins=20)
I believe there is no need to create an implicit attribute which would characterize music preferences, as it's deductable from whole set of attributes "per se". We already have information on the most and 2nd dominant genre. We know the dominance ratio. There is an attribute on total number of genres the user listen(-ed) to.
# Let's plot it and see if there are any correlations
calm_conditions = {'landscape': ['country side', ],
'mood': ['lazy', ],
'driving style': ['relaxed driving', ],
'natural phenomena': ['afternoon', 'night'],
'sleepiness': ['sleepy', ],
'traffic conditions': ['free road', ],
'weather': ['cloudy', ]}
def score_calm_conditions(row):
points = 0.0
if (row['landscape'] is not None) and (row['landscape'] in calm_conditions['landscape']):
points += 1.0
if (row['mood'] is not None) and (row['mood'] in calm_conditions['mood']):
points += 1.0
if (row['DrivingStyle'] is not None) and (row['DrivingStyle'] in calm_conditions['driving style']):
points += 1.0
if (row['naturalphenomena'] is not None) and (row['naturalphenomena'] in calm_conditions['natural phenomena']):
points += 1.0
if (row['sleepiness'] is not None) and (row['sleepiness'] in calm_conditions['sleepiness']):
points += 1.0
if (row['trafficConditions'] is not None) and (row['trafficConditions'] in calm_conditions['traffic conditions']):
points += 1.0
if (row['weather'] is not None) and (row['weather'] in calm_conditions['weather']):
points += 1.0
return points
stimulation_conditions = {'landscape': ['mountains/hills', ],
'mood': ['active', ],
'driving style': ['sport driving', ],
'natural phenomena': ['day time', 'morning'],
'sleepiness': ['awake', ],
'traffic conditions': ['lots of cars', 'traffic jam', ],
'weather': ['rainy', 'snowing', ]}
def score_stimulative_conditions(row):
points = 0.0
if (row['landscape'] is not None) and (row['landscape'] in stimulation_conditions['landscape']):
points += 1.0
if (row['mood'] is not None) and (row['mood'] in stimulation_conditions['mood']):
points += 1.0
if (row['DrivingStyle'] is not None) and (row['DrivingStyle'] in stimulation_conditions['driving style']):
points += 1.0
if (row['naturalphenomena'] is not None) and (row['naturalphenomena'] in stimulation_conditions['natural phenomena']):
points += 1.0
if (row['sleepiness'] is not None) and (row['sleepiness'] in stimulation_conditions['sleepiness']):
points += 1.0
if (row['trafficConditions'] is not None) and (row['trafficConditions'] in stimulation_conditions['traffic conditions']):
points += 1.0
if (row['weather'] is not None) and (row['weather'] in stimulation_conditions['weather']):
points += 1.0
return points
data['no_stimulus_points'] = data.apply(lambda x: score_calm_conditions(x), axis=1)
data['stimulus_points'] = data.apply(lambda x: score_stimulative_conditions(x), axis=1)
data.drop('category_id', axis=1, inplace=True)
data.columns
UserID - ID which uniquely identifies a person driving & listening to track
ItemID - track ID which uniquely identifies a song
Rating - a 5-star rating given to a track
DrivingStyle - what the style of driving (while actually listening to the song)?
Answers questions: a) Are you driving in a relaxed mood? b) Is your driving style is very sportive?
landscape - what type of landscape is behind window?
Answers questions: a) Are you driving along a coast line? b) Are you on a country side? c) Are you driving between mountains or hills? d) Are you driving in a city?
mood - current mood of the driver
Answers questions: a) Are you feeling very active? b) Are you happy? c) Are you lazy now? d) Do you feel sad?
naturalphenomena - what time of a day is it
Answers questions: a) Is it afternoon? b) Is it day time? c) Is it is morning? d) Are you driving at night?
RoadType - what type of road the driver is currently traveling through
Answers questions: a) Are you driving in a city street? b) Are you driving on a highway? c) Are you driving through serpentine curves?
columns_of_interest = ['Rating', 'number_of_songs', 'stimulus_points', 'main_genre_dominance',
'number_of_unique_songs', 'number_of_unique_genres', 'category name']
sns.pairplot(data=data[columns_of_interest], hue="category name", dropna=True)
data.columns
data_encoded = pd.get_dummies(data.drop(['artist', 'title'], axis=1))
'; '.join(data_encoded.columns.tolist())
data_aggr = pysqldf('''select UserID, ItemID, avg(Rating) as avg_rating,
avg(number_of_unique_songs) as number_of_unique_songs,
avg(number_of_unique_genres) as number_of_unique_genres,
avg(genre_ratio) as genre_ratio,
avg(main_genre_dominance) as main_genre_dominance,
sum(no_stimulus_points) as no_stimulus_points,
sum(stimulus_points) as stimulus_points,
sum(`DrivingStyle_relaxed driving`) as driving_style_relaxed_driving,
sum(`DrivingStyle_sport driving`) as driving_style_sport_driving,
sum(`landscape_coast line`) as landscape_coast_line,
sum(`landscape_country side`) as landscape_country_side,
sum(landscape_mountains) as landscape_mountains,
sum(landscape_urban) as landscape_urban,
sum(mood_active) as mood_active,
sum(mood_happy) as mood_happy,
sum(mood_lazy) as mood_lazy,
sum(mood_sad) as mood_sad,
sum(naturalphenomena_afternoon) as natural_phenomena_afternoon,
sum(`naturalphenomena_day time`) as natural_phenomena_day_time,
sum(naturalphenomena_morning) as natural_phenomena_morning,
sum(naturalphenomena_night) as natural_phenomena_night,
sum(RoadType_city) as road_type_city,
sum(RoadType_highway) as road_type_highway,
sum(RoadType_serpentine) as road_type_serpentine,
sum(sleepiness_awake) as sleepiness_awake,
sum(sleepiness_sleepy) as sleepiness_sleepy,
sum(`trafficConditions_free road`) as traffic_conditions_free_road,
sum(`trafficConditions_lots of cars`) as traffic_conditions_lots_of_cars,
sum(`trafficConditions_traffic jam`) as traffic_conditions_traffic_jam,
sum(weather_cloudy) as weather_cloudy,
sum(weather_rainy) as weather_rainy,
sum(weather_snowing) as weather_snowing,
sum(weather_sunny) as weather_sunny,
avg(`category name_Blues music`) as category_name_blues,
avg(`category name_Classical music`) as category_name_classical,
avg(`category name_Country music`) as category_name_country,
avg(`category name_Disco music`) as category_name_disco,
avg(`category name_Hip Hop music`) as category_name_hip_hop,
avg(`category name_Jazz music`) as category_name_jazz,
avg(`category name_Metal music`) as category_name_metal,
avg(`category name_Pop music`) as category_name_pop,
avg(`category name_Reggae music`) as category_name_reggae,
avg(`category name_Rock music`) as category_name_rock,
avg(`dominant_genre_Blues music`) as dominant_genre_blues,
avg(`dominant_genre_Pop music`) as dominant_genre_pop,
avg(`dominant_genre_Rock music`) as dominant_genre_rock,
avg(`second_dominant_genre_Blues music`) as second_dominant_blues,
avg(`second_dominant_genre_Blues music, Classical music, Disco music`) as second_dominant_blues_classical_disco,
avg(`second_dominant_genre_Blues music, Classical music, Hip Hop music`) as second_dominant_blues_classicalsecond_dominant_hh,
avg(`second_dominant_genre_Blues music, Disco music, Rock music`) as second_dominant_blues_disco_rock,
avg(`second_dominant_genre_Blues music, Hip Hop music`) as second_dominant_blues_hh,
avg(`second_dominant_genre_Blues music, Metal music, Reggae music`) as second_dominant_blues_metal_reggae,
avg(`second_dominant_genre_Classical music`) as second_dominant_classical,
avg(`second_dominant_genre_Classical music, Country music`) as second_dominant_classical_country,
avg(`second_dominant_genre_Classical music, Country music, Disco music, Hip Hop music`) as second_dominant_classical_country_disco_hh,
avg(`second_dominant_genre_Classical music, Country music, Disco music, Hip Hop music, Jazz music, Metal music, Rock music`) as second_dominant_classical_country_disco_hh_jazz_metal_rock,
avg(`second_dominant_genre_Classical music, Disco music`) as second_dominant_classical_disco,
avg(`second_dominant_genre_Classical music, Disco music, Reggae music`) as second_dominant_classical_disco_reggae,
avg(`second_dominant_genre_Classical music, Hip Hop music, Rock music`) as second_dominant_classical_hh_rock,
avg(`second_dominant_genre_Country music`) as second_dominant_country,
avg(`second_dominant_genre_Country music, Disco music, Rock music`) as second_dominant_country_disco_rock,
avg(`second_dominant_genre_Country music, Jazz music, Rock music`) as second_dominant_country_jazz_rock,
avg(`second_dominant_genre_Disco music`) as second_dominant_disco,
avg(`second_dominant_genre_Disco music, Hip Hop music`) as second_dominant_disco_hh,
avg(`second_dominant_genre_Jazz music`) as second_dominant_jazz,
avg(`second_dominant_genre_Metal music`) as second_dominant_metal
from data_encoded
group by UserID, ItemID;''')
data_aggr.describe()
report = data.profile_report(style={'full_width':True})
report
f'Rejected variables due to high correlation with some other variables: {report.get_rejected_variables()}'
report_aggr = data_aggr.profile_report(style={'full_width':True})
f'Rejected variables due to high correlation with some other variables: {report_aggr.get_rejected_variables()}'
excluded = []
excluded_aggr = []
# id-columns
excluded.extend(['UserID', 'ItemID'])
excluded_aggr.extend(['UserID', 'ItemID'])
# correlated
excluded.extend(['number_of_unique_songs', ])
excluded_aggr.extend(['genre_ratio', ])
print(f'Excluded columns for normal dataset: {excluded},\n aggregated dataset: {excluded_aggr}')
mms, mms_aggr = MinMaxScaler(), MinMaxScaler()
data_transformed = mms.fit_transform(data_encoded.drop(excluded, axis=1))
data_transformed_aggr = mms_aggr.fit_transform(data_aggr.drop(excluded, axis=1))
sum_of_squared_distances = []
sum_of_squared_distances_aggr = []
k_range = range(1, 30)
for k in k_range:
km = MiniBatchKMeans(n_clusters=k)
km = km.fit(data_transformed)
sum_of_squared_distances.append(km.inertia_)
for k in k_range:
km = MiniBatchKMeans(n_clusters=k)
km = km.fit(data_transformed_aggr)
sum_of_squared_distances_aggr.append(km.inertia_)
plt.figure(figsize=(20, 6))
plt.subplot(1, 2, 1)
plt.plot(k_range, sum_of_squared_distances_aggr, 'bx-')
plt.xlabel('value of [k]')
plt.ylabel('Sum of squared_distances')
plt.title('Elbow method for optimal value of [k] - Aggregated Dataset')
plt.subplot(1, 2, 2)
plt.plot(k_range, sum_of_squared_distances, 'bx-')
plt.xlabel('value of [k]')
plt.ylabel('Sum of squared_distances')
plt.title('Elbow method for optimal value of [k] - "Normal" dataset')
plt.show()
We used an "elbow method" to find an optimal number of clusters for the KMeans algorithm, that number is around k=10
km = MiniBatchKMeans(n_clusters=10)
km = km.fit(data_transformed)
km_aggr = MiniBatchKMeans(n_clusters=10)
km_aggr = km_aggr.fit(data_transformed_aggr)
labels = km.predict(data_transformed)
labels_aggr = km_aggr.predict(data_transformed_aggr)
data_labeled = data.copy()
data_aggr_labeled = data_aggr.copy()
data_labeled['segment'] = labels.tolist()
data_aggr_labeled['segment'] = labels_aggr.tolist()
data_aggr_labeled['category_name'] = data_aggr.apply(lambda x:
data[(data.UserID==x.UserID) &
(data.ItemID==x.ItemID)].iloc[0]['category_name'],
axis=1)
fig = px.scatter(data_aggr_labeled, x="segment",
y="avg_rating", color="category_name",
size="number_of_unique_songs",
hover_data=['stimulus_points', ])
fig.show()
comp_analysis = pd.DataFrame()
comp_analysis['label'] = data_aggr_labeled['segment']
# scaler = MinMaxScaler()
# X_imputed = scaler.fit_transform(data_aggr.drop(['UserID', 'ItemID'], axis=1))
# data_for_pca = pd.DataFrame(X_imputed, columns = data_aggr.drop(['UserID', 'ItemID'], axis=1).columns)
# scaling lowers explained variance
data_for_pca = data_aggr.drop(['UserID', 'ItemID', 'genre_ratio'], axis=1)
pca = PCA(n_components=3) # n_components - number of components to keep
pca_result = pca.fit_transform(data_for_pca.values)
comp_analysis['pca-one'] = pca_result[:,0]
comp_analysis['pca-two'] = pca_result[:,1]
comp_analysis['pca-three'] = pca_result[:,2]
variance = pca.explained_variance_ratio_
print(f'Explained variance: {variance}')
sv = pca.singular_values_
print(f'Singular values: {variance}')
pcv = pca.explained_variance_ratio_
'Total explained variance by {} components is equal to {}'.format(len(pcv), sum(pcv))
fig = plt.figure(figsize=(13, 10))
ax = fig.add_subplot(111, projection= '3d')
ax.scatter(comp_analysis[comp_analysis.label.isin([1,6,2])]['pca-one'],
comp_analysis[comp_analysis.label.isin([1,6,2])]['pca-two'],
comp_analysis[comp_analysis.label.isin([1,6,2])]['pca-three'], s=15, color='b')
ax.scatter(comp_analysis[comp_analysis.label.isin([9,4])]['pca-one'],
comp_analysis[comp_analysis.label.isin([9,4])]['pca-two'],
comp_analysis[comp_analysis.label.isin([9,4])]['pca-three'], s=30, color='g')
ax.scatter(comp_analysis[comp_analysis.label == 3]['pca-one'],
comp_analysis[comp_analysis.label == 3]['pca-two'],
comp_analysis[comp_analysis.label == 3]['pca-three'], s=40, color='y')
ax.scatter(comp_analysis[comp_analysis.label == 8]['pca-one'],
comp_analysis[comp_analysis.label == 8]['pca-two'],
comp_analysis[comp_analysis.label == 8]['pca-three'], s=40, color='r')
ax.set_xlabel('pca-one')
ax.set_ylabel('pca-two')
ax.set_zlabel('pca-three')
T-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.
n_components - dimension of the embedded space.
perplexity - is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity.
early_exaggeration - controls how tight natural clusters in the original space are in the embedded space and how much space will be between them. For larger values, the space between natural clusters will be larger in the embedded space.
tsne_input = data_aggr.drop(['UserID', 'ItemID', 'genre_ratio'], axis=1)
tsne = TSNE(n_components=3,
verbose=1, random_state=42,
perplexity=28.0,
early_exaggeration=12.0,
n_iter=500)
tsne_results = tsne.fit_transform(tsne_input.values)
df_tsne = pd.DataFrame()
df_tsne['label'] = data_aggr_labeled['segment']
df_tsne['x-tsne'] = tsne_results[:,0]
df_tsne['y-tsne'] = tsne_results[:,1]
df_tsne['z-tsne'] = tsne_results[:,2]
# filter out anomalies
df_tsne = df_tsne[df_tsne['z-tsne'] < 1]
fig = plt.figure(figsize=(13, 10))
ax = fig.add_subplot(111, projection= '3d')
ax.scatter(df_tsne[df_tsne.label.isin([1,6,2])]['x-tsne'],
df_tsne[df_tsne.label.isin([1,6,2])]['y-tsne'],
df_tsne[df_tsne.label.isin([1,6,2])]['z-tsne'], s=15, color='b')
ax.scatter(df_tsne[df_tsne.label.isin([9,4])]['x-tsne'],
df_tsne[df_tsne.label.isin([9,4])]['y-tsne'],
df_tsne[df_tsne.label.isin([9,4])]['z-tsne'], s=30, color='g')
ax.scatter(df_tsne[df_tsne.label == 3]['x-tsne'],
df_tsne[df_tsne.label == 3]['y-tsne'],
df_tsne[df_tsne.label == 3]['z-tsne'], s=40, color='y')
ax.scatter(df_tsne[df_tsne.label == 8]['x-tsne'],
df_tsne[df_tsne.label == 8]['y-tsne'],
df_tsne[df_tsne.label == 8]['z-tsne'], s=40, color='r')
ax.set_xlabel('x-tsne')
ax.set_ylabel('y-tsne')
ax.set_zlabel('z-tsne')
Association rule mining is a technique to identify underlying relations between different items. Normally, it is used to predict which items will be bought together. Take an example of a super market, where customers can buy variety of items. Usually, there is a pattern in what the customers buy. We can apply similar technique to predict which songs will be listened to together during a single drive.
songs = []
t_grouped = data_aggr.groupby(['UserID', ])
for group_name, group in t_grouped:
items = []
for row_index, row in group.iterrows():
items.append(music_track[music_track.ItemID == row['ItemID']].iloc[0].title)
songs.append(items)
There are three major components of Apriori algorithm:
Support
Confidence
Lift
results = apriori(songs, min_confidence=0.75, min_lift=5, min_length=2)
The minimum lift of "5" tells us that right element is 5.0 times more likely to be bought by the customers who buy left element compared to the default likelihood of the sale of right element.
apriori_results = itertools.islice(results, 5)
association_rules = list(apriori_results)
for item in association_rules:
# first index of the inner list
# Contains base item and add item
pair = item[0]
items = [x for x in pair]
print("Rule: " + items[0] + " -> " + items[1])
#second index of the inner list
print("Support: " + str(item[1]))
#third index of the list located at 0th
#of the third index of the inner list
print("Confidence: " + str(item[2][0][2]))
print("Lift: " + str(item[2][0][3]))
print("=====================================")
Saving hot-encoded, context aware dataset to a comma-separated file, which can be later read by pandas in different notebook.
data_encoded.to_csv('dataset.csv', sep=",", index=False, header=True)
Saving hot-encoded, aggregated dataset to a comma-separated file.
data_aggr.to_csv('dataset_aggr.csv', sep=",", index=False, header=True)
We're pickling whole datasets to a binary pickle file.
pickle.dump(data_encoded, file=open("dataset.pickle", 'wb'))
pickle.dump(data_aggr, file=open("dataset_aggr.pickle", 'wb'))
# song - category name mapping
data[['ItemID', 'category_name']].drop_duplicates().to_csv('music_type.csv', sep=",", index=False, header=True)